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Abstract 

The Saudi National Assessment Centre administers the Computer Science Teacher Test for teacher certification. 
The aim of this study is to explore gender differences in candidates’ scores, and investigate dimensionality, 
reliability, and differential item functioning using confirmatory factor analysis and item response theory. The 
confirmatory factor analysis results for 6 371 examinees’ scores of 66 multiple-choice items when grouped into 
three content domains showed that the test data were unidimensional (ability, trait). The domains were highly 
correlated (0.883 to 0.949) within this dimension. Data reliability estimated through latent variable modelling 
was acceptable at 0.848. Gender results for DIF signalled 13 items, five cases against males and eight cases 
against females; a finding of some balance in DIF direction against males and females. The study results confirm 
the validity of the Computer Science Teacher Test and support further refinement of multiple forms of the test. 

Keywords: teacher assessment, Saudi Arabia, confirmatory factor analysis, differential item functioning, gender 
differences 

1. Introduction 

Teacher assessment is used for measuring and supporting pre-teacher education outcomes and teachers’ 
professional development. In a review, DeLuca and Bellara (2013) found a multitude of teacher assessment 
standards used by national educational authorities with numerous assessment literacy measures. Further, the 
authors noted shifts in teacher education curricular concepts together with evolution in the national measures for 
student outcomes. In another review, Blomeke and Delaney (2014) noted that whilst teacher assessment studies 
from North America and other English-speaking countries focussed on internal assessment systems and practices 
and contained some cross-country comparisons, a trend to cultural comparison of teacher assessment systems 
had not yet emerged. 

This paper first introduces the Saudi educational environment, and this is followed by a short literature review. 
The methodology and results are presented and discussed, and conclusions drawn. 

2. Saudi Education System 

The Ministry of Education is the sole authority for education in Saudi Arabia, providing a free education for all 
Saudi students through to higher education. The Ministiy also oversees a small educational private sector, 
generally for expatriates. Saudi schools are gender segregated, thus there is a significant number of men in the 
profession. The World Bank (2016) reported that in 2014 there were 761 737 trainees and teachers, pre-school to 
secondary school, of whom 52 per cent were women. 

Teachers in Saudi Arabia are viewed in two ways. Because of their early association with mosques they are 
admired, although the secular education system set up after the 1932 declaration of the Kingdom foundered for 
decades due to its concepts of over-worked and underpaid teachers (Al-Rasheed, 2010). From the 1950s the 
corporation Saudi Aramco (2016) assisted the government in establishing schools to alleviate issues with 
illiteracy, eventually building 139 schools. Initially, boys only were permitted an education due to traditional 
beliefs in the conservative society, but by 1960 the first primary school for girls was opened with one student 
(Bowen, 2015). 

As oil revenues became available, Saudi Arabia was in a better position to plan for the future, and in 1970 set in 
place the first of its five-year economic plans. Education was a priority, both for literacy for the population and 
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to provide the nascent public sector with Saudis to replace the largely foreign workforce (Alshahrani & Alsadiq, 
2014). However, the population growth in the late 20 th century surpassed the Ministry of Education’s ability to 
provide all Saudis with a quality education, and by the 6 th economic development plan (1995-1999) a 
concentrated effort was made to improve the ‘Saudisation’ of the country’s workforce, that is, replacing skilled 
expatriates with skilled Saudis. This emphasis on education continues today (Ahmed, 2016). 

3. Teacher Education 

Teaching in 20 th century Saudi schools was criticised as being conservative and didactic (Norton & Syed, 2003). 
Teachers’ education was expected to be at bachelor degree standard, but due to the pressure of population growth, 
this was not enforced and diplomates were accepted. Pedagogical practices were didactic, teachers explained 
principles of the curriculum, but did not engage the students who were thus passive learners, recording their 
lessons and memorising for examinations. A report for the Ministry of Education recommended, inter alia, 
improved teacher education, and in 2004 the Ministry embarked on a decade-long plan (Tatweer) to improve the 
quality of education in the Kingdom (International Bureau of Education, 2011). 

As part of Tatweer’s emphasis on teacher education, competencies were prepared for pedagogical, numeracy and 
literacy skills; however, these were not adequately administered and did not achieve the standards expected 
(Alzaydi, 2011; Alsharif, 2011; A1 Shannag, Tairab, Dodeen, & Abdel-Fattah, 2013). Elyas and Pickard (2013) 
stated that teacher outcomes were challenged by variables in students’ backgrounds, the rise of educational 
technology, and universities’ hierarchies. Shortfalls in teacher competencies had external effects for the Ministry 
of Education. Comparing Saudi and Singapore results for grade 8 students from a 2007 international study 
(Trends in Mathematics and Science Study), A1 Shannag et al. (2013) found that the Saudi teachers retained their 
teacher-centric style, whilst the more successful Singaporean teachers practised a student-centric educational 
system. 

The Ministry of Education responded to these reports by implementing a change in focus from teacher to student. 
The National Centre for Assessment in 2010 developed a new teacher assessment framework, the National 
Professional Teacher Standards. The framework comprises 12 standards in two groups, the first of which was 
pedagogical: professional knowledge, promoting learning, supporting learning, and professional responsibility 
(Al-Saud & Al-Sadaawi, 2014). The second type is the subject-specific teaching standards for 25 curricular 
courses. The standards guide teacher licensing examinations, identify training needs for new teachers, and set the 
quality of teaching programs. 

As an example, one of the courses is the Computer Science Teacher Test (CSTT) for secondary school. It 
consists of three domains: computer and math, engineering and science, computer applications, and computer 
and education. Based on the 2010 standards, the test has been administered to 20 028 candidates, of whom 37 
per cent were female (Ministry of Education, 2016). 

This study investigates the validity of the test data by examining their dimensionality and key features such as 
reliability, and differential item functioning on gender in the framework of item response theory. 

4. Literature Review 

Confirmatory factor analysis seeks relationships between measurement data, which is, test results or indicators, 
and is used to identify latent variables (factors) (Brown, 2015). Unlike exploratory factor analysis, confirmatory 
factor analysis is hypothesis-based, thus all aspects of the model are pre-specified. This form of analysis is used 
to ‘verify the number of underlying dimensions of the instrument (factors) and the pattern of item-factor 
relationships (factor loadings)’ (Brown, 2015, p.l). Netemeyer et al. (2013) stated that confirmatory factor 
analysis can be used to assess dimensionality (fit, correlated measurement errors, degree of cross-loading). 

In designing tests and measures which produce large data such as the Computer Science Teacher Test, 
dimensionality refers to the homogeneity of items and sub-items. Netemeyer, Bearden, and Sharma (2003) 
explained that a unidimensional measure indicates a single latent variable that accounts for item data (responses), 
whereas a multidimensional measure has more than one latent variable among the data. In designing such tests, a 
unidimensional internal structure is a step towards establishing reliability (consistency between items) and 
validity (consistency between the measure’s constructs). Whilst unidimensionality is used in confirmatory factor 
analysis, it is also a fundamental assumption in item response theory (Deng, Wells, & Hamilton, 2008). 

In longitudinal research, the analysis of measurement invariance of latent constructs is important as scores may 
vary over time. For example, in education, repetitive examination of cohorts of students determines the progress 
of individuals over the course of their education or is used to compare group scores. Measurement invariance 
was predated by Joreskog’s (1971, p.409) observation of ‘similarities and differences in factor structures 
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between different groups’. Joreskog posited that parameters in factor analysis models (factor variances, factor 
loadings, factor covariance and unique variances) may be constrained, or assigned an arbitrary value. 
Measurement invariance was introduced by Byrne, Shavelson, and Muthen (1989) using sensitivity analyses for 
stability in baseline models, ‘determining partially invariant measurement parameters, and . . . testing for the 
invariance of factor covariance and mean structures, given partial measurement invariance’ (Byrne et al. 1989, p. 
456). Measurement invariance, or measurement equivalence, thus establishes that each iteration measures the 
same construct (latent variable). 

Reliability concerns the permanent effect that is being investigated does persist from one sample to another. 
Raykov (2004) and Raykov, Dimitrov, and Asparouhov (2010) used latent variable modelling for measurement 
invariance and reliability. Raykov (2004, 2012) argued that coefficient alpha does not estimate scale reliability at 
population levels, and proposed another reliability coefficient model based on scale reliability rather than the 
restrictions of Cronbach’s a (Cronbach, 1951). Cronbach’s a requires that the factor loadings of all items are 
equal. More recently, Raykov, Gabler, and Dimitrov (2016, p.l) established a latent variable modelling 
procedure ‘for point and interval estimation of the difference between the maximal and scale criterion validity 
coefficients’. This overcomes issues regarding the use of unidimensional multicomponent measures. 

Criterion-related validity is one aspect of validating an instrument, referring to an item on a questionnaire 
actually measuring the intended outcome (Lodico, Spaulding, & Voegtle, 2010). The others include face validity 
(relevance of items to intent), content validity (items relevant to the content being measured). Criterion-related 
validity reflects the relationship between two scores on two different measures, and tests whether the outcome 
from the measure, its performance, can be predicted (Lodico et al. 2010). Raykov’s (2007) latent variable 
modeling approach is used in this research for reliability and criterion validity. 

Item response theory, a paradigm for the measurement of items in relation to the latent variable, is used 
extensively in education tests, including test construction, estimating ability and score reporting (Deng et al., 
2008). Item response models take into consideration the degree of difficulty of each item in scaling items. Item 
response theory has, as noted, an assumption of unidimensionality (Deng et al. 2008). 

Differential item functioning refers to the potential for bias in the test items which could skew data (be unfair) to 
sub-groups based on gender, race or age (Strobl, Kopf, & Zeileis, 2010). The bias may exist in a single item, or 
goodness-of-fit tests may show a trend, or a likelihood of bias among the variables. 

There is a wide variety of statistical techniques for evaluating difference in both dichotomous and polytomous 
items (Gomez-Benito, Hidalgo, & Zumbo, 2013; Hambleton & Swaminathan, 2013; Sireci & Rios, 2013). 
Among these, that of Mantel-Haenszel (1959) remains a reference technique (Guilera, Gomez-Benito, Hidalgo, 
& Sanchez-Meca, 2013). Strobl et al. (2010) explained that testing for difference can be based on the specific 
sub-group supporting interpretation but leaving open the possibility of unexplained bias. At an extreme, all item 
parameter differences can be tested for bias among all possible sub-groups, leading to interpretation difficulty. 
Strobl et al. proposed a semi-parametric model using recursive partitioning to address this. 

5. Methodology 

The data were the scores of 6 371 examinees on 66 multiple-choice items on the Saudi Computer Science 
Teachers Test. The test had four response options per item, one only of which was correct, so the item scoring is 
1 for correct response and 0 otherwise. The test items were classified as follows: 

Domain 1: Computer and math, engineering and science (35 items). 

Domain 2: Computer applications (12 items). 

Domain 3: Computer and education (19 items). 

Confirmatory factor analysis was used to test the validity of hypothesised models of the test and its three 
content-specific domains. The first question concerned the dimensionality of the data. Three different 
confirmatory factor analysis models were tested and compared on data fit with the teacher test scores: model A: a 
one-factor model; model B: a three-factor model with the three content-specific domains as correlated latent 
factors; and model C: a three-factor model with the three content-specific domains as uncorrelated latent factors. 
The models were tested for data fit using the program Mplus (Muthen, 2016). In the Mplus syntax for the three 
models, the factor indicators (test items) were declared as categorical variables because the item scores are 
dichotomous (0/1). Thus the factor analysis was based on the tetrachoric correlations (i.e., observed values are 
dichotomous) for the scores of the test items. This avoided issues using Pearson correlations for factor analysis 
of categorical variables. The analysis of test data for item response theory used the program Xcalibre 4 
(Assessment Systems, 2016). 
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The score reliability was estimated through the use of a latent variable modelling (LVM) approach taking into 
account the binary nature of the item scores (Dimitrov, 2012; Raykov, 2007; Raykov et ah, 2010). The 
congeneric model for latent normal variables Y{, Y 2 , ...., Y*. assumed to underlie a set of binary items Y h 
Y 2 , ..., Y p according to Joreskog (1971) is: 

Y* = X lV + £i (1) 

where r| is a common latent factor with a variance set equal to 1, X, are factor loadings, s, are latent disturbances, 
and the probability of correct response on Y, is given by the area under the standard normal curve to the right of a 
pertinent threshold k, (i = 1, 2, ..., p). Under this model, the score reliability, p, is estimated through the 
following equation (e.g., Bollen, 1989): 

_ _ (a 1 +/t 2 + -+^p ) 2 _ ™ 

(A^+ A 2 H-rAp) +VAR(£^)+VAR(£ 2 )H-hVAR(£ p ) 

where the numerator represents the tme-score variance and the denominator represents the total variance (i.e., 
the sum of true variance and error variance). 

Cronbach’s a for internal consistency (reliability) was also used, however, the results underestimated the 
reliability obtained from the latent variable modelling approach, confirming the literature review discussion. 
Further, under the congeneric measurement model in equation 1, the assumption of tau-equivalency is met when 
the factor loadings are equal, X l = X 2 = ... =X p (e.g., Joreskog, 1971). 

In differential item functioning analyses, groups are compared on item performance after adjusting for overall 
performance on the measured trait (Hambleton & Swaminathan, 2013). The Mantel-Haenszel techniques under 
the null hypothesis are distributed as a chi-square distribution with one degree of freedom. Under this procedure, 
an effect size estimate based on the common odds ratio a is expressed as 


Z-WA, 


Z B fj IN ., 

7=1 


( 3 ) 


Holland and Thayer (1988) proposed a logarithmic transformation of a for interpretive purposes, with the aim of 
obtaining a symmetrical scale in which a zero value indicates an absence of DIF, a negative value indicates that 
the item favours the reference group over the focal group, and a positive value indicates DIF in the opposite 
direction. This transformation, delta metric, is expressed as 

Aa mh = -2.35 In (a mh) (4) 

6. Results 

The test results for data fit of the three models (A, B, and C) are summarised in Table 1. 


Table 1. Data fit of three CFA models from Teacher Test Data 


CFA 

Model 




90% Cl for RMSEA 




r 

df 

CFI 

TLI WRMR 

RMSEA 

Lower 

limit 

Upper limit 

A: one factor 

4851.829 

2079 

.923 

.920 

1.471 

.014 

.014 

.015 

B: 3 correlated factors 

C: 3 uncorrelated 

factors 

4752.691 

2076 

.926 

.923 

No convergence 

1.455 

.014 

.014 

.015 


The assessment of model fit is based on the evaluation of the following goodness-of-fit indices, with cutting 
scores for an excellent fit as follows: 

• Comparative fit index: CFI > 0.95; Incremental Fit Index: IFI > 0.95; 

• Standardised root mean square residual: SRMR = 0.00 (SRMR < 1.00 for an adequate fit); 

• Root mean square error of approximation: RMSEA = 0.00 (RMSEA < 0.05 for an adequate data fit (Hu & 
Bender, 1999; Marsh, Wen, & Hau, 2004). 
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The results in Table 1 indicate that the one-factor model (model A) provides an adequate data fit. A very slight 
improvement in data fit is obtained with model B, where the correlations between the three domains of the test 
are taken into account. These correlations were found to be very high, ranging from 0.883 to 0.949 (see Table 2). 


Table 2. Correlations among Teacher Test Domains 


Domain 

Domain 1 

Domain 2 

Domain 3 

1: Computer & math, engineering and science 

1.000 



2: Computer applications 

0.909 

1.000 


3: Computer & education 

0.883 

0.949 

1.000 


Data fit results in table 1 showed high correlations among the domains in models A and particularly B, therefore 
the teacher test data are essentially unidimensional. Model C, where the three test domains are assumed 
uncorrelated, does not converge with the test data. 

The standardised item factor loadings and thresholds of the 66 items of the test under the one-factor CFA model 
(model A) are provided in the appendix. The analysis of the sample showed 60.5% were females and 39.5% 
males, which differed from the overall population. All factor loadings were statistically significant (p < .001), 
with the exception of the loading for item 45 (p = .428) and item 65 (p= .340). 

The reliability of the data was estimated by a latent variable modelling (LVM) (equations 1 and 2). The 
reliability estimate was found to be 0.848 at a 95% confidence level = (0.842; 0.854). Cronbach’s a was 0.749 
and thus underestimated the LVM reliability (a < 0.848), as discussed above. 

The data were tested for DIF across gender using the two Mantel-Haenszel statistics: a MH and A a MH 
(equations 3 and 4), the results of which are provided in the appendix. For interpretation of these results, a MH is 
reported with a z-statistic and its p-v alue, where DIF is signalled by statistically significant z-value (p < .05); 
with DIF against males if z >0 and DIF against females if z <0. The absolute values of the statistic A a MH are 
used to classify DIF into three categories: category A — negligible DIF when \Aa MH \ < 1.0; category B — 
moderate DFI when 1 < | Aa MH \ < 1.5; and category C — large DIF, when | Aa MH \ > 1.5 (Holland & Thayer, 
1988). 

Based on these criteria, the results in the appendix indicated that DIF is signalled for 13 items, of which 9 items 
fall in the category B for moderate DIF (6 against females and 3 against males) and 4 items in the category C for 
large DIF (2 against females and 2 against males). The remaining 43 items are either not signalled for DIF or 
were categorised as A, negligible DIF, acceptable for the purposes of this study (see Zwick & Ercikan, 1989). 

7. Conclusion 

This study examined the factor structure of the Computer Science Teacher Test and its psychometric 
characteristics to validate interpretations and decisions about certification of teachers in Saudi Arabia. The 
results showed that the test items are essentially unidimensional, confirming the use of item response modelling. 

The results in this study support the validity of interpretations and decisions related to certification of teachers in 
Saudi Arabia based on their computer test scores. This outcome should guide test developers and researchers at 
the National Assessment Centre in further the evolution of the Computer Science Teacher Test. 
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Appendix 

Testing for gender on teacher test items (Differential Item Functioning) 


Item 

a MH 

z-statistic 

/7-value 

A <*MH 

DIF against 

DIF category 

i 

0.5911 

6.9970 

0.0000 

1.2355 

males 

B (moderate) 

2 

1.1523 

-1.3610 

0.1740 

-0.3332 


No DIF 

3 

1.2330 

-2.7190 

0.0065 

-0.4922 


A (negligible) 

4 

0.9865 

0.1230 

0.9024 

0.0321 


No DIF 

5 

0.9349 

0.7359 

0.4618 

0.1581 


No DIF 

6 

0.7298 

3.6550 

0.0003 

0.7401 


A (negligible) 

7 

0.7343 

3.3002 

0.0010 

0.7258 


A (negligible) 

8 

1.1017 

-1.2138 

0.2248 

-0.2275 


A (negligible) 

9 

1.846 

-6.4345 

0.0000 

-1.4406 

females 

B (moderate) 

10 

0.8479 

2.2229 

0.0262 

0.3879 


A (negligible) 

11 

0.9725 

0.3688 

0.7123 

0.0655 


No DIF 

12 

0.7039 

4.7878 

0.0000 

0.8250 


A (negligible) 

13 

1.2737 

-2.6286 

0.0086 

-0.5685 


A (negligible) 

14 

1.3073 

-3.3718 

0.0007 

-0.6296 


A (negligible) 

15 

0.6606 

5.1835 

0.0000 

0.9742 


A (negligible) 

16 

0.6828 

5.0853 

0.0000 

0.8968 


A (negligible) 

17 

0.8416 

2.0918 

0.03650 

0.4051 


A (negligible) 

18 

0.9132 

1.0142 

0.3105 

0.2134 


No DIF 

19 

0.7786 

2.6756 

0.0075 

0.5880 


A (negligible) 

20 

0.7304 

3.6901 

0.0002 

0.7383 


A (negligible) 

21 

0.5252 

7.1830 

0.0000 

1.5133 

males 

C (large) 

22 

0.9829 

0.2220 

0.8243 

0.0406 


No DIF 

23 

1.0071 

-0.0882 

0.9297 

-0.0165 


No DIF 

24 

0.7974 

2.6956 

0.0070 

0.5322 


A (negligible) 

25 

1.1605 

-1.8146 

0.0696 

-0.3499 


No DIF 

26 

0.6043 

6.5963 

0.0000 

1.1836 

males 

B (moderate) 

27 

0.7716 

3.4417 

0.0006 

0.6093 


A (negligible) 

28 

0.6076 

5.6195 

0.0000 

1.1708 

males 

B (moderate) 

29 

0.8163 

2.6602 

0.0078 

0.4769 


A (negligible) 

30 

0.7745 

3.1275 

0.0018 

0.6004 


A (negligible) 

31 

0.8153 

2.6957 

0.007 

0.4799 


A (negligible) 

32 

1.3394 

-2.2987 

0.0215 

-0.6867 


A (negligible) 

33 

1.6394 

-5.4650 

0.0000 

-1.1617 

females 

B (moderate) 

34 

1.0716 

-0.7715 

0.4404 

-0.1624 


No DIF 

35 

1.0406 

-0.4418 

0.6586 

-0.0936 


No DIF 

36 

1.6061 

-4.6009 

0.0000 

-1.1135 

females 

B (moderate) 

37 

2.0133 

-8.3547 

0.0000 

-1.6445 

females 

C (large) 

38 

1.5842 

-5.2056 

0.0000 

-1.0811 

females 

B (moderate) 

39 

0.8059 

1.9446 

0.0518 

0.5070 


No DIF 

40 

0.6878 

4.3058 

0.0000 

0.8796 


A (negligible) 

41 

0.8392 

2.2340 

0.0255 

0.4121 


A (negligible) 

42 

1.2712 

-3.0046 

0.0027 

-0.5639 


A (negligible) 

43 

0.7265 

4.3195 

0.0000 

0.7508 


A (negligible) 

44 

1.5249 

-5.2729 

0.0000 

-0.9916 


A (negligible) 

45 

0.6903 

3.2308 

0.0012 

0.8709 


A (negligible) 

46 

0.7761 

3.0779 

0.0021 

0.5958 


A (negligible) 

47 

0.7401 

3.8934 

0.0001 

0.7072 


A (negligible) 

48 

0.9653 

0.4565 

0.6480 

0.0830 


No DIF 

49 

0.8060 

2.9540 

0.0031 

0.5068 


A (negligible) 

50 

1.4009 

-4.3295 

0.0000 

-0.7922 


A (negligible) 

51 

1.0747 

-0.6942 

0.4875 

-0.1692 


No DIF 

52 

1.6379 

-4.2836 

0.0000 

-1.1596 

females 

B (moderate) 
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53 

1.1705 

-1.8934 

0.0583 

-0.3699 


No DIF 

54 

1.0602 

-0.7830 

0.4336 

-0.1373 


No DIF 

55 

0.7934 

3.1834 

0.0015 

0.5438 


A (negligible) 

56 

0.8003 

2.9867 

0.0028 

0.5234 


A (negligible) 

57 

1.0537 

-0.6966 

0.4860 

-0.1228 


No DIF 

58 

1.0834 

-1.0354 

0.3005 

-0.1883 


No DIF 

59 

0.8656 

1.9848 

0.0472 

0.3392 


A (negligible) 

60 

0.9149 

0.4148 

0.6783 

0.2090 


No DIF 

61 

0.7785 

3.4656 

0.0005 

0.5883 


A (negligible) 

62 

0.5101 

9.0260 

0.0000 

1.5820 

males 

C (large) 

63 

1.1264 

-1.4038 

0.1604 

-0.2797 


No DIF 

64 

1.6543 

-6.0123 

0.0000 

-1.1830 

females 

B (moderate) 

65 

2.1570 

-7.7439 

0.0000 

-1.8065 

females 

C (large) 

66 

1.0573 

-0.5265 

0.5986 

-0.1308 


No DIF 


Note. The items in category A (negligible DIF) are considered to function for the purposes of this study (e.g., Zwick & Ercikan, 1989) 
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